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Abstract 



In this work, we construct a risk estimator for hard thresholding which can be used as a basis to solve the difficult task of automatically 
selecting the threshold. As hard thresholding is not even continuous, Stein's lemma cannot be used to get an unbiased estimator of degrees of 
freedom, hence of the risk. We prove that under a mild condition, our estimator of the degrees of freedom, although biased, is consistent. Numerical 
evidence shows that our estimator outperforms another biased risk estimator proposed in Jfl. 
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I. Introduction 

We observe a realisation y £ R p of the normal random vector Y = Xq + W, W ~ M{xq, <r 2 Idp). Given an estimator y h-> x(y, A) of 



' xo evaluated at y and parameterized by A, the associated Degree Of Freedom (DOF) is defined as [ 2 1 



p 



ms m w a cav(Yi,x(Yi, A)) 
"nT df{x}(xo,X) = \ i ^ ^ . (1) 



The DOF plays an important role in model/parameter selection. For instance, define the criterion 

\\Y ~x(Y,\))\\ 2 -Pa 2 + 2a 2 df{x}(Y,\) . (2) 

m : 

• In the rest, we denote div the divergence operator. If x(-,\) is weakly differentiable w.r.t. its first argument with an essentially bounded 

■ gradient, Stein's lemma J3] implies that df{x}(Y, A) = div (x(Y, A)) and $2^ (the SURE in this case) are respectively unbiased estimates 
' of df{x}(xo, A) and of the risk Ew \\x(Y, A) — a;o|| 2 - In practice, ((2) relies solely on the realisation y which is useful for selecting A 
. minimizing l|2}. 

In this paper, we focus on Hard Thresholding (HT) 

y h-» HT(y, A)i = / ^ lf l^l <A > (3) 
y vy ' ; \ yi otherwise . 

^sj^ | HT is is not even continuous, and the Stein's lemma does not apply, so that df{x}(xo, A) and the risk cannot be unbiasedly estimated |T). 
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II. Stein consistent Risk Estimator (SCORE) 

■ Remark that the HT can be written as 



To overcome this difficulty, we build an estimator that, although biased, turns out to enjoy good asymptotic properties. In turn, this allows 
efficient selection of the threshold A. 



HT(y,A) = ST(y,A) + Z%,A) 

{yi + A if yi < -A 
if - A < yi < +A 

yi — A otherwise 

( -A if yi < -A 
and D(y, A), = <^ if - A < y t < +A , 
[ +A otherwise 

where y h-> ST(y, A) is the soft thresholding operator. Soft thresholding is a Lipschitz continuous function of y with an essentially bounded 
gradient, and therefore, appealing to Stein's lemma, an unbiased estimator of its DOF is given by df{ST}(Y, A) = divST(Y,A). This 
DOF estimate at a realization y is known to be equal to > A}, i.e., the number of entries of \y\ greater than A (see |4|, |5)). The 

mapping y h-> D(y,X) is piece-wise constant with discontinuities at ±A so that Stein's lemma does not apply to estimate the DOF of 
hard thresholding. To circumvent this difficulty, we instead propose an estimator of the DOF of a smoothed version replacing D(-,A) by 
Qh * D(., A) where Qh is a Gaussian kernel of bandwidth h > and * is the convolution operator. In this case Qh * D(., A) is obviously 
C°° whose DOF can be unbiasedly estimated as div (Gh * D(., A)(Y)). To reduce bias (this will be made clear from the proof), we have 
furthermore introduced a multiplicative constant, y/a 2 + h 2 /a, leading to the following DOF formula 



V df{RT}(y, A, h) = #{|y| > A} + ^gg £ [exp (-^) +exp . (4) 
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We now give our two main results proved in Section II VI 



Algorithm Risk estimation for Hard Thresholding 



Inputs: observation y £ R , threshold A > 

Parameters: noise variance a 2 > 

Output: solution x* 

Initialize h h(P) 

for all A in the tested range do 

Compute x <— HT(y, A) using l[3) 

Compute d/"{HT}(j/, A, h) using (0 

Compute SCORE at y using © 
end for 

return x* a; that provides the smallest SCORE 



Fig. 1, Pseudo-algorithm for HT with SCORE-based threshold optimization. 




Fig. 2. Risk and its SCORE estimate with respect to the threshold A. 
Theorem 1: Let Y = x + W for W ~ Af(x ,a 2 Id P ). Take h(P) such that limp^oo h(P) = and limp-^ P _1 ^(P) _1 = 0. Then 
plirrip^ i (<#{HT}(Y; A,ft(P)) - d/{HT}(x , A)) = 0. In particular 

1. lim E w \idf{H.T}(Y,\,h(P))] = lim id/{HT}(j , A), and 

P—toc P— »oo 



2. lim 

P^oo 



V[^df{HT}(y,A,ft(P))] = , 



where V» is the variance w.r.t. W. 

We now turn to a straightforward corollary of this theorem. 



Corollary 1: Let Y = x + W for W ~ A/"(xo, cr Idp), and assume that ||xo|| 4 = o(P ' ). Take ft(P) such that limp-^ h(P) = 
and limp^oo P _1 /i(P) _1 = 0. Then, the Stein Consistent Risk Estimator (SCORE) evaluated at a realization y of Y 



SCORE{z}(j,, A, /i(P)) = £ ( (y? - a 2 ) + I(\ Vi \ > A)(2a 2 - y 2 ) + 2a A ^g )2 [exp (- 
i=i ^ 

is such that plinip^ ± (sCORE{a;}(y, A, ft(P)) - E w ||HT(y, A) - x \\ 2 ) =0. 



2h(P) ; 



Fig. [TJ summarizes the pseudo-code when applying SCORE to automatically find the optimal threshold A that minimizes SCORE in a 
predefined (non-empty) range. 

III. Experiments and conclusions 

Fig. [2] shows the evolution of the true risk, the SCORE and the risk estimator of (TJ as a function of A where xo is a compressible vector 
of length P = 2E5 whose sorted values in magnitude decay as £o|(j) = l/* 7 for 7 > 0, and we have chosen a such that the SNR of y is 
of about 5.65dB and h(P) = Gcr/P 1 ^ 3 « o/W. The optimal A is found around the minimum of the true risk. 

Future work will concern a deeper investigation of the choice of h(P), comparison with other biased risk estimators, and extensions to 
other non-continuous estimators and inverse problems. 

IV. Proof 



We first derive a closed-form expression for the DOF of HT. 

Lemma 1: Let Y = x + W where W ~ JV(x , <r 2 Idp). The DOF of HT is given by 



-/fUIT}(.r„.A) =\P-\Y t 

i=l 



erf 



{xo)j + A 
V2o 



erf 



(x )j - A 
y/2a 



+ 



A 



27TCT 



exp 



(Qro)* + A) s 
2a 2 



+exp 



!(, (l); -A) 2 _y 

(5) 



2a 2 



Proof: According to JT], we have 

d/{HT}(a;o, A) =E W {#{\Y\ > \}]+X/o 2 E w 



Y,sigxi(Y i )W i I(\Y i \>X) 



where sign(.) is the sign function and I(oj) is the indicator for an event cj. Integrating w.r.t. to the zero-mean Gaussian density of variance 
a 2 yields the closed form of the expectation terms. ■ 
We now turn to the proof of our theorem. 

Proof: The first part of (|5} corresponds to Ew,-[#{|Y| > A}], and can then be obviously unbiasedly estimated from an observation y 
by #{|y| > A}. Let A be the function defined, for (t,a) G R 2 , by 

(t-ar 



Va 2 + h 2 
A(t, a) = exp 



2h 2 



By classical convolution properties of Gaussians, we have 



[A(Y,a)] =exp - 



V Wi [A(Yi,a)] 



((x )i - of 



2[a 2 + h' 2 ) ) ' 



and 



a 2 + h 2 



hV2a 2 + h 2 

Taking h — h(P) and assuming limp^oo h(P) = shows that 



exp 



((xn)i-a,y 
2a 2 + h 2 



-exp 



((xo)i-a) 
a 2 + h 2 



lim Ew 

P-JOG 



p p p 



((x )i-a) 
2a 2 



Since from 10, we have 



df{RT}(Y, A, h) = #{|Y| > A} + -jL- V [A(y 4 , A) + A(Y<, -A)] 



and using Lemma [7] statement 1. follows. 

For statement 2., the Cauchy-Schwartz inequality implies that 



1 1/2 



-df{UT}(Y,X,h) 



. Vw[#{\Y\ > A}] 1/2 , A 

^ T-» ~r 



E [Vw [A(Y U X)] 1 ' 2 + Vw [A(Yi, -X)] 1/2 ] 



#{|Y| > A} ~ iid Bin(P, 1 -p) whose variance is Pp(l-p), where p = | (erf ( ( ^ !" A ) - erf 



It follows that 



lim 1 

P-ioo 



= o 



Taking again h = ft(P) with limp-nx, h(P) = and limp-^ P 1 h(P) 1 = 0, yields 



lim Yw 

P->oo 



A^A(Y,a) = p Bm ^^V w [i4(y 4 ,o)]=0, 

i=l J i=l 

where we used the fact that the random variables Y are uncorrelated. This establishes 2.. Consistency (i.e. convergence in probability) 
follows from traditional arguments by invoking Chebyshev inequality and using asymptotic unbiasedness and vanishing variance established 
in 1. and 2.. ■ 
Let us now prove the corollary. 

Proof: By assumption, limp^oo h(P) — 0. Thus by by virtue of statement 1. of Theorem Q] and specializing (O to the case of HT 
gives 

' SCOJiLUJ I r( ii. A. /)(./■' ) ) = 



lim Ew 

P->00 



lim l-E w [||Y-HT(Y,A))|| 2 -Pa 2 + 2a 2 #{HT}(Y,A,£(P))l 

P— >oo f L J 

lim \-E w ||Y-HT(Y,A))|| 2 -cr 2 + 2cr 2 lim ^-E w df{RT}(Y,X,h(P)) 

P—too P P — Voo P 

lim ||Y-HT(Y,A))|| 2 -cr 2 + 2cr 2 lim ^-df{W£}{x Q ,\) 

P— Yoo P P— »-oc P 

lim ^Ew\\B.T(y,X) - x»\\ 2 

P— >-oo _P 



where we used the fact that all the limits of the expectations are finite. The Cauchy-Schwartz inequality again yields 



-,1/2 



-SCORE{HT}(Y, A, h(P)) 



«S Vw 



i(||Y-HT(y,A)|| 2 ) 



1 1/2 



= "5 EV^[|Y| 2 /(|Y|<A)] +2a 2 V w 



i=l 
P 



+ 2a 



1/2 



±df{KT}(Y,\,h(P)) 



-,1/2 



-df{HT}(Y,X,h(P)) 



1/2 



(P 



-,1/2 



-d/{HT}(y,A^(P)) 



P 2 



« 2|ko|| . 3(7 

6 - -p2- + ^F 



1/2 



I II 2 



+ 6(7 



2 IFO 



3a 4 

p + -p 



+ 2ct 

1/2 



; d/{HT}(y,A,/i(P)) 



1/2 



+ 2(7^ 



-df{RT}(Y,\,h(P)) 



1/2 



As by assumption, limp-^oo P 1 h(P) 1 = and ||aJo|| 4 = o , the variance of SCORE vanishes as P — > oo. We conclude using 

the same convergence in probability arguments used at the end of the proof of Theorem [T] ■ 
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